Towards Building a Collection of Web Archiving Research Articles

نویسندگان

  • Brenda Reyes Ayala
  • Cornelia Caragea
چکیده

The field of Web Archiving exists in a fluid, fragmented, and heterogeneous state. Part of the problem is that this field is relatively new and its literature is scattered across a wide range of journal and conference venues. This makes the state of Web Archiving as a discipline particularly difficult to ascertain. This paper presents an approach to building a collection of articles about the subject. We begin with a small dataset of articles taken from a Web Archiving Bibliography and then proceed to expand it by crawling the Web and collecting additional documents. The crawled documents are then classified using machine learning classification techniques. We show that by extracting the documents’ titles and abstracts and representing them using the “bag of words” approach, we are able to accurately identify documents from the Web crawler as documents that are about Web Archiving. We also discuss our results in the context of Web Archiving as an emerging field. INTRODUCTION The field of Web Archiving arose to address fears of a Digital Dark Age, caused by the gradual disappearance of digital information (Kuny, 1997). Many institutions to date have implemented Web Archiving programs, a notable example being the Internet Archive, which in 1996 began to capture snapshots of the entire Web with the purpose of preserving them for future generations. Additionally, many national libraries began archiving their own national domains as part of an effort to preserve their digital cultural heritage. University libraries also followed suit, often looking to expand on the strengths of their existing physical collections. As a nascent field, Web Archiving exists in an uncertain and continuously evolving state. Although there have been several efforts to establish standards and coordinate Web Archiving initiatives, for example, through the foundation of the International Internet Preservation Consortium (IIPC) in 2003, the field remains fluid, fragmented, and heterogeneous, and consequently, so does its literature. Published articles on Web Archives are relatively few compared to older, more established disciplines, and are scattered across a wide range of journals and conferences, including the ACM Web Science Conference (WebSci), the Joint Conference on Digital Libraries (JCDL), and the D-Lib Magazine. For example, a query for the phrase “Web Archiving” in a well-known database such as Web of Science (Thomson Reuters, 2014) returns 27 results, whereas queries for phrases “information retrieval” and “information science” in the same database return 880 and 3,586 results, respectively. Authors who do research in Web Archiving generally do not have official scholarly journals or publication venues, which can provide a sense of the progress or evolution of their field. In short, the state of Web Archiving as a discipline is currently almost impossible to discern. This fact presents a challenge to a researcher interested in understanding the field: What is the current state of scholarly publication in the field of Web Archiving? The current state of a field cannot be ascertained without a corpus of publications in that field that can be examined. To address the above challenge, we pose our main research question: How do we gather and understand a corpus of Web Archiving research articles, given the scattered nature of the field? In this paper, we present a process, grounded in information retrieval and machine learning techniques, for gathering a corpus of literature about an emerging field. RELATED WORK In the Information Science field, there has been much work done on the subject of exploring and analyzing academic disciplines, usually by making use of bibliometric data. In their prominent study, White and McCain (1998) conducted an extensive domain analysis of the field of Information Science utilizing data from Social Scisearch. They presented a variety of visualizations of the field, such as the most prominent authors, major sub-disciplines, and paradigm shifts over time. Chen (2006) utilized the Java application CiteSpace II to provide an overview of the trends and patterns in the scientific literature of the research fields of mass extinction and terrorism. More recently, Wang and Tang (2013) mapped the development of the emerging field of open innovation using data from Web of Science and CiteSpace II. We would like to highlight the fact that these research efforts differ from ours in one key factor: the aforementioned authors were working in fields with a strong presence in academic databases and citations indexes. This abundance of bibliometric data and research publications made the task of compiling data and corpus building a substantially less difficult task. This situation is not the case with the field of Web Archiving, and so we ASIST 2014, November 1-4, 2014, Seattle, WA, USA. Copyright is retained by the author(s). Towards Building a Collection of Web Archiving Research Articles were forced to look for other alternatives of building a research corpus such as employing machine learning techniques for document classification. Crawling the Web for relevant articles to assemble a dataset seemed like a potentially effective strategy. In the literature, there have been several studies on focused web crawling, a strategy that collects only Web pages that satisfy some specific property, e.g., they belong to a particular topic. Focused crawling first proposed by De Bra et al. is a rich area of research on the Web (Bra et al., 1994), (Junghoo Cho et al., 1998). Chakrabarti et al. (1999) present a discussion on the main components involved in building a focused crawler. Bergmark, Lagoze, and Sbityakov (2002) discuss some of the crawling technologies for building document collections as well as ways to make the crawler highly effective. Batsakis, Petrakis, and Milios (2009) propose state-of-the-art crawlers strategies that use the content of Web pages as well as the link information in order to estimate the relevance of Web pages tied to specific topics. Other works on focused crawling include (Li, Wang, and Du, 2013); (Yang, Kang, and Choi, 2005). Wu et al. described the evolution of a crawling strategy for CiteSeer, which is an academic document search engine (Wu et al., 2012a). CiteSeer actively crawls the Web for academic and research documents primarily in Computer and Information Sciences. The authors experimented with using a whitelist (a list of only certain domains that should be crawled) to improve the crawling efficiency of the CiteSeer crawler. They found that crawling the whitelist significantly increased the crawl precision by reducing a large amount of irrelevant requests and downloads. In another study, Wu et al. developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and their associated metadata to the CiteSeer crawl repository and database. This middleware provides a universal interface to the crawl database and is designed to support input from multiple open source crawlers and archival formats (Wu et al., 2012b). Caragea et al. (2014) presented a record linkage approach to building a scholarly big dataset, derived from the CiteSeer dataset, which is substantially cleaner than the entire set. More precisely, the authors’ approach was to integrate information from an external data source to remove noise in CiteSeer that results due to automated techniques used for metadata extraction from Web crawled documents. In contrast to the above works, we make use of information retrieval and machine learning techniques such as focused crawling and text classification to construct a scholarly dataset of Web Archiving research articles. The dataset is available for download to the research community and will particularly be useful to researchers interested in Web Archiving and newcomers to this field. 1 1 http://digital.library.unt.edu/ark:/67531/metadc330569/ BUILDING A COLLECTION OF WEB ARCHIVING RESEARCH ARTICLES In this section, we present our crawling strategy for building a collection of research articles gathered from the Web, that are related to the topic of Web Archiving. We describe the main steps of the crawling process: 1. Compile an initial set of documents related to Web Archiving, which represents the seed set. 2. Similarly, compile a set of documents, which are not related to Web Archiving. 3. Train a classifier to accurately discriminate between Web Archiving versus non-Web Archiving documents. 4. Extract the authors from the articles related to Web Archiving in our seed set and perform a crawling using these authors’ names as well as all their found coauthors as queries that are input to a generic search engine and download other research articles that these authors have published previously. 5. Use the trained classifier in Step 3 to automatically identify the documents related to Web Archiving. We present further details of these steps in what follows. We start with an initial corpus (our seed set) composed of 124 documents about Web Archiving that we extracted from a comprehensive bibliography on the subject (Reyes Ayala, 2013). This bibliography was put together over the course of several months using a variety of methods, such as querying search engines and downloading the publications of prominent authors in the field. We also gathered a separate corpus of randomly chosen documents from many different disciplines. At the end of this process, we had 124 articles about Web Archiving and 206 randomly chosen articles, for a total of 330 articles. We used a Python library to extract their titles and abstracts. Some documents did not have abstracts, and in such cases, we instead used the document’s first 300 words. The motivation for extracting only the title and abstract from a document was that in many cases, documents on the Web are not available as full text, but only as title and abstract. We refer to this set of documents as Original. During the Steps 1 and 2 of the crawling process, we manually labeled these documents using the following labels: documents about Web Archiving were labeled as positive or +1, while documents on other topics were labeled as negative or -1. Using this labeled dataset, we trained machine learning classifiers to discriminate between the positive and negative documents. In order to address our research question and discover other documents on the Web that are related to Web Archiving, we employed a focused crawling in Step 4. First, from our original small Web Archiving dataset (i.e., our positive seed set), we extracted the authors’ names and their co-authors. We then crawled the Web for these names in order to extract each author’s publications, regardless of its subject of study. We ran several of these crawls, merged the results, and de-duplicated them. The final, merged results from our crawls contained 3,953 items. We refer to this dataset as Crawl. Next, we provide details of our classification task. Web Archiving Research Paper Identification We describe our classification task for identifying research articles that are related to the topic of Web Archiving from a collection of documents obtained by crawling the Web. More precisely, our problem can be formulated as follows: given a crawled document, the task is to classify it into one of two classes: Web Archiving articles (the positive class, denoted as +1) and non-Web Archiving articles (the negative class, denoted as -1). To address this problem, we represented the documents using the commonly used “bag of words” approach for text classification, used in (Mccallum & Nigam, 1998). The “bag of words” approach constructs a vocabulary, which contains all unique words in a collection of documents. A document is then represented as a vector x with as many entries as the words in the vocabulary, where an entry i in x records the frequency (in the document) of the i word in the vocabulary, denoted by xi. We further represented the documents using tf-idf (term frequency-inverse document frequency). The inverse document frequency is given as log ! !" . N is the number of documents in the collection, and df is the document frequency of a term in the collection, i.e., the number of documents that contain a particular term. Using these representations, we trained various machinelearning classifiers to classify research papers as Web Archiving or not. These classifiers are Support Vector Machines (SVM), Naïve Bayes Multinomial (NBM) and Logistic Regression (LR) (Bishop, 2006). Experimental Design Our experiments are designed around the following research questions: • What are the units of information (e.g., title, abstract, or both the title and abstract) that most accurately distinguish between documents about Web Archiving and documents about other topics? • How well do classifiers trained to identify Web Archiving documents perform “in the wild,” i.e., on a random sample of documents obtained as a result of a focused crawling? More precisely, how well do our classifiers generalize to Web crawled documents? • What are some of the characteristics of Web Archiving documents obtained by using a focused crawler? To answer our first question, we extracted the feature representation for each document using three different units of information, the title, the abstract, and both the title and abstract, and trained and compared several classifiers on these feature representations, SVM, NBM and LR. We used the Weka implementation of these classifiers with the default parameters in10-fold cross-validation experiments. 2 http://www.cs.waikato.ac.nz/ml/weka/ To answer our second question, we evaluated the best resulting classifiers (from the previous experiment) “in the wild.” Specifically, by construction, the dataset of 330 examples is fairly balanced, i.e., the number of negative examples is only slightly bigger than the number of positives ones. However, this is not the case in a real-world scenario, where we expect the number of Web Archiving documents to be only a small fraction of the total number of academic documents on the Web. Hence, the performance of a classifier tested using cross-validation on a fairly balanced set would be overestimated. Note that the goal of our previous experiment was to determine the best feature representation and classifier type for our task. To perform a more realistic evaluation of our classifiers, we randomly sampled a subset of 500 documents directly from the crawl and manually labeled them as positive and negative. We refer to this dataset as Random. From this dataset, we extracted the documents’ titles and abstracts, and encoded them in the same way as we did for the Original dataset. We then ran the same classification experiments using the Original dataset for training and the Random dataset for testing. Since in our previous experiments the Naïve Bayes classifier yielded the best performance, we used it on the Random dataset. To evaluate the performance of our models, we report the Accuracy and Precision, Recall, and F-score for the positive class, since we are mainly interested in accurately classifying Web Archiving articles. These measures are widely used in Information Retrieval applications. Finally, to answer our third question, we used the best resulting classifier from the first experiment to predict a label for each of the 3,953 documents obtained from our focused crawler (i.e., the Crawl dataset). We extracted the documents’ titles and abstracts and encoded them in the same way as before. We characterize the collection in terms of venue popularity, i.e., the venues containing articles on Web Archiving, as well as proficient authors, i.e., authors who published articles in the field of Web Archiving.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

رابطه بین شاخص‌های پایگاه استنادی علوم و ریسرچ‌گیت: مطالعه موردی مقاله‌‌های داغ و پُراستناد پژوهشگران ایرانی

The present study aimes to assess “Top Papers” of Iranian researchers and includes “Highly Cited Paper” and “Hot Papers” based on Citation Indicators of Web of Science and Altmetrics Indicators of Reseach Gate. This is an applied research and was conducted using Scientometric and Altmetrics indicators. The required data was collected from SCI and ResearchGate. SPSS version 16 was used to analyz...

متن کامل

Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research

BACKGROUND Articles whose authors have supplemented subscription-based access to the publisher's version by self-archiving their own final draft to make it accessible free for all on the web ("Open Access", OA) are cited significantly more than articles in the same journal and year that have not been made OA. Some have suggested that this "OA Advantage" may not be causal but just a self-selecti...

متن کامل

Data Mining Web Archives

Many institutions are now building rich, significant archives of web content. Though the number of web archiving programs has grown, access models for these collections have remained focused on URL-based discovery and traditional live-web-style browsing. Given the resources required to build and maintain web archives, finding new forms of access for these collection will help increase use and t...

متن کامل

Ethical Issues in Web Archive Creation and Usage – Towards a Research Agenda

While Web archiving initiatives rescue a wealth of information on the Web from being permanently lost, the massive collection of Web data poses not only fascinating possibilities for accessing a vast amount of information, as well as an invaluable resource for scientist wanting to understand the technological and sociological development of the Web and society at large. It also constitutes a ne...

متن کامل

Study of the Attitude of Users towards Picture Archiving and Communication System Based on the Technology Acceptance Model in Teaching Hospitals of Qom, Iran

Background and Objectives: Many healthcare providers use health information technology to improve their performance. Picture Archiving and Communication System is a subsystem of the health information system that aims to facilitate the storing, archiving, and managing of digital images as well as their transmission. In this regard, measuring the level of acceptance of technology can be very hel...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014